Whats Covered
- Mutating joins
- Filtering joins and set operations
- Assembling data
- Advanced joining
- Case Study
Aditional Resources
Filtering joins and set operations
- Filtering joins return a copy of the dataset that has been filtered, not augmented (as with mutating joins)
Semi-joins
Apply a semi-join
As you saw, semi-joins provide a concise way to filter data from the first dataset based on information in a second dataset.
For example, the code in the editor uses
semi_join()to create a data frame of the artists inartistswho have written a song insongs.# View the output of semi_join() new_artists <- artists %>% semi_join(songs, by = c("first", "last")) # Create the same result new_artists <- artists %>% right_join(songs, by = c("first", "last")) %>% filter(!is.na(instrument)) %>% select(first, last, instrument) new_artists %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")first last instrument John Lennon Guitar Paul McCartney Bass Tom Jones Vocals You can see that a single semi-join is the equivalent of three separate operations.
Exploring with semi-joins
Semi-joins provide a useful way to explore the connections between multiple tables of data.
For example, you can use a
semi-jointo determine the number of albums in thealbumsdataset that were made by a band in thebandsdataset.albums %>% # Collect the albums made by a band semi_join(bands, by = c("band")) %>% # Count the albums made by a band nrow() %>% # Create pretty table kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")x 5 A more precise way to filter?
- You can use semi_joins with a df of date you want to match to easily filter
tracks## # A tibble: 21 x 6 ## track band label year first last ## <chr> <chr> <chr> <int> <chr> <chr> ## 1 Can't Buy Me Love The Beatles Parlopho~ 1964 Paul McCart~ ## 2 I Feel Fine The Beatles Parlopho~ 1964 John Lennon ## 3 A Hard Day's Night The Beatles Parlopho~ 1964 John Lennon ## 4 Sound of Silence Simon and Garf~ Columbia 1964 Paul Simon ## 5 Help! The Beatles Parlopho~ 1965 John Lennon ## 6 Ticket to Ride The Beatles Parlopho~ 1965 John Lennon ## 7 I am a Rock Simon and Garf~ Columbia 1965 Paul Simon ## 8 Yellow Submarine / Elean~ The Beatles Parlopho~ 1966 Paul McCart~ ## 9 Homeward Bound Simon and Garf~ Columbia 1966 Paul Simon ## 10 Scarborough Fair Simon and Garf~ Columbia 1966 unkno~ unknown ## # ... with 11 more rowsmatches## # A tibble: 3 x 3 ## band year first ## <chr> <int> <chr> ## 1 The Beatles 1964 Paul ## 2 The Beatles 1965 John ## 3 Simon and Garfunkel 1966 Paul# With semi-join tracks %>% semi_join(matches,by = c("band", "year", "first")) %>% # Create pretty table kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")track band label year first last Can’t Buy Me Love The Beatles Parlophone 1964 Paul McCartney Help! The Beatles Parlophone 1965 John Lennon Ticket to Ride The Beatles Parlophone 1965 John Lennon Homeward Bound Simon and Garfunkel Columbia 1966 Paul Simon # With dply filter statement tracks %>% filter( (band == "The Beatles" & year == 1964 & first == "Paul") | (band == "The Beatles" & year == 1965 & first == "John") | (band == "Simon and Garfunkel" & year == 1966 & first == "Paul") ) %>% # Create pretty table kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")track band label year first last Can’t Buy Me Love The Beatles Parlophone 1964 Paul McCartney Help! The Beatles Parlophone 1965 John Lennon Ticket to Ride The Beatles Parlophone 1965 John Lennon Homeward Bound Simon and Garfunkel Columbia 1966 Paul Simon The filter does return the same observations as the semi-join, but it takes a bit of work to set things up properly, as you can see.
Anti-joins
- Full list of joins available; anti-join can be used to see which rows will not be matched to a second dataset by a join.
Apply an anti-join
Anti-joins provide a useful way to reason about how a mutating join will work before you apply the join.
# Return rows of artists that don't have bands info artists %>% anti_join(bands, by = c("first", "last")) %>% # Create pretty table kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")first last instrument Tom Jones Vocals Davy Jones Vocals Joe Perry Guitar Elvis Presley Vocals Paul Simon Guitar Joe Walsh Guitar Brian Wilson Vocals Nancy Wilson Vocals Have you noticed that correct keys are as important to filtering joins as they were to mutating joins?
Apply another anti-join
Anti-joins with
anti_join()also provide a great way to diagnose joins that go wrong.For example, they can help you zero-in on rows that have capitalization or spelling errors in the keys. These things will make your primary and secondary keys appear different to R, even though you know they refer to the same thing.
# Check whether album names in labels are mis-entered labels %>% anti_join(albums, by = c("album"))## # A tibble: 1 x 2 ## album label ## <chr> <chr> ## 1 A Hard Days Night ParlophoneYou can think of anti-join as a debugging tactic for joins. The next exercise will help you decide when to use
anti_join()and when to usesemi_join().Which filtering join?
Think you have filtering joins down? Let’s check.
Which filtering join would you use to determine how many rows in
songsmatch a label inlabels?# Determine which key joins labels and songs labels %>% # Create pretty table kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")album label Abbey Road Apple A Hard Days Night Parlophone Magical Mystery Tour Parlophone Led Zeppelin IV Atlantic The Dark Side of the Moon Harvest Hotel California Asylum Rumours Warner Brothers Aerosmith Columbia Beggar’s Banquet Decca songs %>% # Create pretty table kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")song album first last Come Together Abbey Road John Lennon Dream On Aerosmith Steven Tyler Hello, Goodbye Magical Mystery Tour Paul McCartney It’s Not Unusual Along Came Jones Tom Jones # Check your understanding songs %>% # Find the rows of songs that match a row in labels semi_join(labels, by = c("album")) %>% # Number of matches between labels and songs nrow() %>% # Create pretty table kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")x 3 Set operations
unionwill return every row that appears in one or more of the datasets- If a row appears multiple times
unionwill only return it once
- If a row appears multiple times
interesectwill return only the rows that appear in both datasetssetdiffwill return the rows that appear in the first dataset but not the second
How many songs are there?
We have loaded two datasets in your workspace,
aerosmithandgreatest_hits, each of which represents an album from the band Aerosmith. Each row in either of the datasets is a song on that album.How many unique songs do these two albums contain in total?
aerosmith %>% # Create the new dataset using a set operation union(greatest_hits) %>% # Count the total number of songs nrow() %>% # Create pretty table kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")x 24 The
union()function provides an easy way to combine two datasets without duplicating any values.Greatest hits
Which songs from Aerosmith made it onto Greatest Hits?
# Create the new dataset using a set operation aerosmith %>% intersect(greatest_hits) %>% # Create pretty table kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")song length Dream On 04:28:00 You can think of
intersect()as the set operator equivalent of asemi-join. It is what you would use if your datasets contain the exact same variables.Live! Bootleg songs
livecontains the songs from a third Aerosmith album, Live! Bootleg.Which songs are on Live! Bootleg but not on Greatest Hits? Notice that the length of songs may be different when they are performed live.
# Select the song names from live live_songs <- live %>% select(song) # Select the song names from greatest_hits greatest_songs <- greatest_hits %>% select(song) # Create the new dataset using a set operation live_songs %>% setdiff(greatest_songs) %>% # Create pretty table kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")song Lord of the Thighs Toys in the Attic Sick as a Dog Sight for Sore Eyes S.O.S. (Too Bad) I Ain’t Got You Mother Popcorn/Draw the Line Train Kept A-Rollin’/Strangers in the Night Multiple operations
There is no set operation to find rows that appear in one data frame or another, but not both. However, you can accomplish this by combining set operators.
Can you think of a combination that would answer the question, “Which songs appear on one of Live! Bootleg or Greatest Hits, but not both?”
- You’ll need
setdiff(),intersect(), andunion(), but not necessarily in that order.
# Select songs from live and greatest_hits live_songs <- live %>% select(song) greatest_songs <- greatest_hits %>% select(song) # Find songs in at least one of live_songs and greatest_songs all_songs <- union(live_songs, greatest_songs) # Find songs in both common_songs <- intersect(live_songs, greatest_songs) # Find songs that only exist in one dataset setdiff(all_songs, common_songs) %>% # Create pretty table kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")song Lord of the Thighs Toys in the Attic Sick as a Dog Sight for Sore Eyes S.O.S. (Too Bad) I Ain’t Got You Mother Popcorn/Draw the Line Train Kept A-Rollin’/Strangers in the Night Same Old Song and Dance Seasons of Winter Big Ten Inch Record Draw the Line Kings and Queens Remember (Walking in the Sand) Lightning Strikes Sweet Emotion (remix) One Way Street (live) Uncommonly good uncovering of uncommon songs! Working with set operators is fairly simple; it doesn’t get much trickier than this. However, set operators do have one type of odd behavior, which you will learn about in the next exercise.
Unique values
eagles1974contains five rows, one for each member of the Eagles (a band) in 1974. Likewise,eagles1971contains four rows, one for each member of the band in 1971.Examine the contents of each dataset, then determine how many rows will be in the union of
eagles1974witheagles1971. Does something surprising happen here?eagles1974 eagles1971 union(eagles1974, eagles1971)How can a
unionreturn fewer values than are in the original dataset? Theunionremoves the duplicate row for Don. The rows represent two different Dons, Don Felder and Don Henley, but there’s nothing in eagles1974 to indicate this.Comparing datasets
- Its common to want to know if one data set is the same as another dataset
- dplyr’s
setequalwill do this easily - base R’s
identicalis will only return true if the datasets have the exact same rows in the exact same order
Recap:
Apply setequal
definitiveandcompletecontain the songs that appear in competing Led Zeppelin anthologies: The Definitive Collection and The Complete Studio Recordings, respectively.Both anthologies claim to contain the complete studio recordings of Led Zeppelin, but do the anthologies contain the same exact songs?
# Check if same order: definitive and complete identical(definitive, complete)## [1] FALSE# Check if any order: definitive and complete setequal(definitive, complete)## [1] FALSE# Songs in definitive but not complete setdiff(definitive, complete)## # A tibble: 15 x 2 ## song album ## <chr> <chr> ## 1 Rock and Roll The Song Remains the Same ## 2 Celebration Day The Song Remains the Same ## 3 Black Dog The Song Remains the Same ## 4 Over the Hills and Far Away The Song Remains the Same ## 5 Misty Mountain Hop The Song Remains the Same ## 6 Since I've Been Loving You The Song Remains the Same ## 7 No Quarter The Song Remains the Same ## 8 The Song Remains the Same The Song Remains the Same ## 9 The Rain Song The Song Remains the Same ## 10 The Ocean The Song Remains the Same ## 11 Dazed and Confused The Song Remains the Same ## 12 Stairway to Heaven The Song Remains the Same ## 13 Moby Dick The Song Remains the Same ## 14 Heartbreaker The Song Remains the Same ## 15 Whole Lotta Love The Song Remains the Same# Songs in complete but not definitive setdiff(complete, definitive)## # A tibble: 0 x 2 ## # ... with 2 variables: song <chr>, album <chr>The Definitive Collection contains the soundtrack for The Song Remains the Same, a documentary not featured in The Complete Studio Recordings.
Apply setequal again
A few exercises ago, you saw that an
intersect()is analagous to asemi_join()when two datasets contain the same variables and each variable is used in the key.Under these conditions,
setdiff()is also analagous to one of the filtering joins.# Return songs in definitive that are not in complete definitive %>% anti_join(complete, by = c("song", "album")) %>% # Create pretty table head(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")song album Rock and Roll The Song Remains the Same Celebration Day The Song Remains the Same Black Dog The Song Remains the Same Over the Hills and Far Away The Song Remains the Same Misty Mountain Hop The Song Remains the Same Since I’ve Been Loving You The Song Remains the Same No Quarter The Song Remains the Same The Song Remains the Same The Song Remains the Same The Rain Song The Song Remains the Same The Ocean The Song Remains the Same # Return songs in complete that are not in definitive complete %>% anti_join(definitive, by = c("song", "album")) %>% # Create pretty table head(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")song album When your datasets contain the same variables, a
setdiff()does the same thing as ananti_join()that uses every column as a key.Comparing albums
It appears that The Definitive Collection contains songs from the soundtrack of The Song Remains the Same, a movie filmed during a live Led Zeppelin concert. Is this the only difference between The Definitive Collection and The Complete Studio Recordings?
nrow(complete)## [1] 85nrow(definitive)## [1] 100# Get union of complete and soundtrack complete_and_soundtrack <- union(complete, soundtrack) # Check if same, including order: definitive and union of complete and soundtrack complete_and_soundtrack %>% identical(definitive)## [1] FALSE# Check if same, rows in any order: definitive and union of complete and soundtrack complete_and_soundtrack %>% setequal(definitive)## [1] TRUESession info
sessionInfo()## R version 3.5.2 (2018-12-20) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 10 x64 (build 16299) ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252 ## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C ## [5] LC_TIME=German_Switzerland.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] ggplot2_3.1.0 dplyr_0.8.0.1 gapminder_0.3.0 kableExtra_1.0.1 ## [5] knitr_1.21 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_1.0.0 highr_0.7 plyr_1.8.4 ## [4] pillar_1.3.1 compiler_3.5.2 prettydoc_0.2.1 ## [7] tools_3.5.2 digest_0.6.18 gtable_0.2.0 ## [10] evaluate_0.12 tibble_2.0.1 viridisLite_0.3.0 ## [13] pkgconfig_2.0.2 rlang_0.3.1 cli_1.0.1 ## [16] rstudioapi_0.9.0 yaml_2.2.0 xfun_0.4 ## [19] withr_2.1.2 httr_1.4.0 stringr_1.4.0 ## [22] xml2_1.2.0 hms_0.4.2 webshot_0.5.1 ## [25] grid_3.5.2 tidyselect_0.2.5 glue_1.3.0 ## [28] R6_2.4.0 fansi_0.4.0 rmarkdown_1.11 ## [31] readr_1.3.1 purrr_0.3.0 magrittr_1.5 ## [34] codetools_0.2-15 scales_1.0.0 htmltools_0.3.6 ## [37] assertthat_0.2.0 rvest_0.3.2 colorspace_1.4-0 ## [40] utf8_1.1.4 stringi_1.3.1 lazyeval_0.2.1 ## [43] munsell_0.5.0 crayon_1.3.4
A work by Guillaume Abgrall
guillaume.abgrall@hotmail.fr